Stereoencephalography (SEEG) is a technique used in drug-resistant epilepsy patients that may be a candidate for surgical resection of the epileptogenic zone. Multiple electrodes are placed using a so-called "frame based" stereotactic approach, in our case using the Leksell frame. In our previous paper "Methodology, outcome, safety and in vivo accuracy in traditional frame-based stereoelectroencephalography" by Van der Loo et al (2017) we reported on SEEG electrode implantation accuracy in a cohort of 71 patients who were operated between September 2008 and April 2016, in whom a total of 902 electrodes were implanted. Data for in vivo application accuracy analysis were available for 866 electrodes.
The goal of the current project is to use a public version of this dataset (without any personal identifiers) to predict electrode implantation accuracy by using and comparing different machine learning approaches.
Pieter Kubben, MD, PhD
neurosurgeon @ Maastricht University Medical Center, The Netherlands
The public dataset contains these variables:
The electrodes are the Microdeep depth electrodes by DIXI Medical.
To the limited extent possible in this case I tried to make these FAIR data and adhere to FAIR guiding principles. In practice this meant I introduced the topic, described my data and created a DOI.
Now let's get started.
In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
%matplotlib inline
import warnings; warnings.simplefilter('ignore')
%xmode plain; # shorter error messages
# global setting whether to save figures or not
# will save as 300 dpi PNG - all filenames start with "fig_"
save_figures = False
In [2]:
# load data
electrodes = pd.read_csv('../data/electrodes_public.csv')
# find missing values
nan_rows = sum([True for idx,row in electrodes.iterrows() if any(row.isnull())])
print('Nr of rows with missing values:', nan_rows)
We will calculate target point localization error (TPLE) using the Euclidean distance and remove the entry data as we won't be using them (still wanted to share them in dataset though).
In [3]:
# calculate TPLE and remove entry data from dataframe
electrodes['TPLE'] = np.sqrt(np.square(electrodes['TipX'] - electrodes['PlanningX']) +
np.square(electrodes['TipY'] - electrodes['PlanningY']) +
np.square(electrodes['TipZ'] - electrodes['PlanningZ'])
).round(1)
electrodes.drop(['EntryX', 'EntryY', 'EntryZ'], axis = 1, inplace = True)
electrodes.head()
Out[3]:
Now we will remove large outliers (difference between planned and real coord) in the Z-axis as electrode insertion length (depth) is influenced also by other factors (calculations regarding depth which could lead to either too superficial or too deep, but also possible malfixation of the screw cap which may cause loosening of the electrode and hence a more superficial position.. it won't migrate into the depth spontaneously). These are very limited numbers and would too much influence further analysis.
In [4]:
# check for outliers in Z axis
large_depth_error = electrodes[np.abs(electrodes['PlanningZ'] - electrodes['TipZ']) > 10]
print('Outliers in Z axis (> 10mm):\n\n', large_depth_error['TPLE'])
# remove outliers
electrodes.drop(large_depth_error.index, inplace = True)
print('\nNew dataframe shape:', electrodes.shape) # removed 6 rows
We need to structure our data properly for further analysis and convert the categorical variables (nominal, ordinal) to the category
type.
In [5]:
# convert categorical columns to "category" dtype
catcols = ['PatientPosition', 'Contacts', 'ElectrodeType', 'ScrewLength']
for cat in catcols:
electrodes[cat] = electrodes[cat].astype('category')
# confirm correct types for all columns now
electrodes.dtypes
Out[5]:
Let's get a short description of our TPLE data.
In [6]:
# get summary data on TPLE
tple = electrodes['TPLE']
tple.describe().round(1)
Out[6]:
We now have data in the right format, but for classification we need to bin the continuous outcome variables EPLE and TPLE into categories. Alternatively we could approach this as a regression problem, but given the relative limited amount of data classification should lead to a better prediction model that is still relevant for potential clinical use.
Let's create a new variable TPLE category
for this purpose.
In [7]:
# create different possible cuts to create categories (and experiment with them)
tple_max = tple.max().round()
electrodes_3cat = pd.cut(tple, bins = [0, 2.5, 5, tple_max], labels = ['0 - 2.5', '2.5 - 5', '> 5'])
electrodes_4cat = pd.cut(tple, bins = [0, 2, 4, 6, tple_max], labels = ['0 - 2', '2 - 4', '4 - 6', '> 6'])
electrodes_4quant = pd.cut(tple, bins = [0, tple.quantile(.25), tple.median(), tple.quantile(.75), tple_max], labels = ['0 - 2', '2 - 3', '3 - 4', '> 4'])
electrodes_5cat = pd.cut(tple, bins = [0, 1, 2, 5, 10, tple_max], labels = ['0 - 1', '1 - 2', '2 - 5', '5 - 10', '> 10'])
electrodes_7cat = pd.cut(tple, bins = [0, 1, 2, 4, 6, 8, 10, tple_max], labels = ['0 - 1', '1 - 2', '2 - 4', '4 - 6', '6 - 8', '8 - 10', '> 10'])
# apply cut to create TPLE category column
electrodes['TPLE category'] = electrodes_5cat
nr_of_y_categories = len(electrodes['TPLE category'].unique()) # needed for confusion matrix and keras MLP
# check correct conversion with first 15 rows
electrodes[['TPLE', 'TPLE category']].head(15).T
Out[7]:
In [8]:
# count nr of items in each category
electrodes[['TPLE', 'TPLE category']].groupby('TPLE category').count().T
Out[8]:
So we have a lof of electrodes in the 2-5mm TPLE category, which is explained well with an interquartile range of 2.0 - 4.1. Let's create some plots to learn more about our data.
In [9]:
# plot TPLE distribution
tple.plot(kind = 'density', figsize = (15,5), title = 'TPLE density plot');
plt.xlim(0,10);
if save_figures:
plt.savefig('fig_tple_density_plot.png', dpi = 300)
In [10]:
# plot variable distributions (density)
params = {'subplots': True, 'layout': (3,4), 'sharex': False, 'figsize': (16, 12)}
elec_visual = electrodes.drop(['TPLE'], axis = 1)
elec_visual.plot(kind='density', **params);
if save_figures:
plt.savefig('fig_variables_density_plot.png', dpi = 300)
In [11]:
# alternatively, make a box plot
elec_visual.plot(kind='box', sharey = False, **params);
if save_figures:
plt.savefig('fig_variables_boxplot.png', dpi = 300)
Now we have a visual impression of how variables are distributed. As we want to develop a model to predict TPLE category, it would be helpful to see to what extent variable distribution differs per TPLE category.
In [12]:
# first relate numeric variables to TPLE category
numcols = list(electrodes.columns[electrodes.dtypes == float].drop('TPLE'))
fig, axes = plt.subplots(4,3, figsize = (15,20))
fig.subplots_adjust(hspace=.5)
electrodes.boxplot(column = numcols, by = 'TPLE category', ax = axes);
if save_figures:
plt.savefig('fig_tplecompare_boxplot.png', dpi = 300)
In contrast to numerical variables in which we can plot categories against actual feature values, in categorical columns we can "only" plot categories against the number of items in each category. This is what we will do next.
Note that these are absolute counts, and no ratios (e.g. supine position is used most and this reflects also below). So the point is to inspect the graphs for any remarkable ratio differences between categories and not simply look at the highest bars...
In [13]:
# show absolute TPLE count per category
elec_count = 'Electrode count'
for cat in catcols:
cat_count = electrodes[[cat, 'TPLE','TPLE category']].groupby([cat,'TPLE category']).count()
cat_count.rename_axis({'TPLE': elec_count}, axis = 'columns', inplace = True)
sns.factorplot(x = cat, y = elec_count, data = cat_count.reset_index(), kind = 'bar', hue = 'TPLE category',
size = 5, aspect = 3);
if save_figures:
plt.savefig('fig_tplecompare_cat_{}.png'.format(cat), dpi = 300)
Now, how do they correlate with each other? To be able to get correlations we first have to deal with missing data. We will create a temporary soluting here for visualisation and use another (for that purpose recommended) approach for predictive modelling.
In [14]:
# copy dataframe to temporary frame in which we'll impute missing values
elec_pairplot = electrodes
# create a utility function to check if we (still) have missing values present in our data
def print_missing(dataframe):
'''checks which columns contain missing values and returns count'''
df_na = dataframe.columns[electrodes.isnull().any()]
if len(df_na) > 0:
print('Missing data present in {} columns: \n'.format(len(df_na)))
for c in df_na:
print('- {} ({})'.format(c, dataframe[c].isnull().sum()))
else:
print('No missing data found! :-)')
print_missing(elec_pairplot)
In [15]:
# use median values for numeric columns
for missing in ['PlanningRing', 'PlanningArc', 'DuraTipDistancePlanned']:
elec_pairplot[missing] = elec_pairplot[missing].fillna(elec_pairplot[missing].median())
# use most frequent value for categorical columns
for missing in ['Contacts', 'ScrewLength']:
most_frequent_value = elec_pairplot[missing].value_counts().index[0]
print('Most frequent value in category "{}": {}'.format(missing, most_frequent_value))
elec_pairplot[missing] = elec_pairplot[missing].fillna(most_frequent_value)
print_missing(elec_pairplot)
From practical experience I can confirm that we do use 18 contact points frequently often and it makes sense to use those here (only 56 missing... more advanced may be to correlate with other variables and decide then). Regarding screw length, 25mm is used by far the most, so even despite the fact that most values are missing (534 / 860) I still kept them in (in the original paper I referred to, increasing screw length seems to correspond with increasing TPLE).
Now let's look at some correlations.
In [16]:
# pairplot to correlate variable distributions per category.. busy plot
sns.pairplot(elec_pairplot[['PatientPosition', 'Contacts', 'ElectrodeType', 'DuraTipDistancePlanned',
'SkinSkullDistance', 'SkullThickness', 'SkullAngle', 'ScrewLength', 'TPLE category']],
hue = 'TPLE category');
if save_figures:
plt.savefig('fig_electrodes_corr_pairplot.png', dpi = 300)
Those are not exactly easily separable clusters... looks bad for further predictive modelling.... :-(
Alternatively we will use Peason's correlations to create a "heatmap". To include categorical variables we need to "one hot encode" them first (using pd.get_dummies()
).
In [17]:
plt.figure(figsize = (15,12));
electrodes_dummies = pd.get_dummies(electrodes.drop(['TPLE category'], axis = 1))
# correlations for continuous variables only
# sns.heatmap(electrodes.corr(), square = True, annot = True, linewidths = .5, cmap = 'RdBu_r');
# correlations including categorical variables
sns.heatmap(electrodes_dummies.corr(), square = True, annot = False, cmap = 'RdBu_r');
if save_figures:
plt.tight_layout() # needed to prevent cutting X-axis labels
plt.savefig('fig_electrodes_corr_heatmap.png', dpi = 300)
We will now use several machine learning classification approaches (manual, auto ML and a quick glance at deep learning) to predict TPLE category. I tried several feature selection and dimensionality reduction approaches that I left in place as comments as they did not contribute to better results.
In [30]:
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures, Imputer, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold
elec_features = electrodes.drop(['TPLE', 'TPLE category'], axis = 1)
# encode categorical features (no one hot encoding to avoid creating too much features)
for cat in catcols:
elec_features[cat] = LabelEncoder().fit_transform(elec_features[cat])
X = elec_features # to start with
# deal with missing values the sklearn way (and do not impute if not needed to)
X = Imputer(strategy = 'most_frequent').fit_transform(X)
# feature selection - none of these actually improved accuracy (not tried for all other outcomes)
# X = elec_features[['ScrewLength', 'PatientPosition', 'ElectrodeType', 'Contacts', 'SkullAngle']]
# X = elec_features['SkullAngle'][:, np.newaxis]
# X = MinMaxScaler().fit_transform(X) # is it correct to do this on whole dataset?
# X = VarianceThreshold(threshold=(.8 * (1 - .8))).fit_transform(X)
# X = PolynomialFeatures(2, interaction_only = True).fit_transform(X)
# X = SelectKBest(chi2, k = 5).fit_transform(X, y)
# encode outcome (separate encoder, want to make sure to be able to reverse later)
le = LabelEncoder()
y = le.fit_transform(electrodes['TPLE category'])
In [31]:
# dimensionality reduction using PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = X.shape[1])
X_pca = pca.fit_transform(X)
# plot PCA variance ratio
plt.figure(figsize = (10,6));
plt.plot(pca.explained_variance_ratio_.cumsum());
plt.title('Explained variance ratio by PCA components');
plt.xlabel('PCA component');
plt.ylabel('Cumulative variance ratio');
if save_figures:
plt.savefig('fig_mlmanual_pca.png', dpi = 300)
In [32]:
# use 5 principal components... (explains > 95% of variance) even slightly reduces accuracy...
# X = PCA(5).fit_transform(X)
Now split the data intro train and test set (maybe I should have done this earlier, e.g. using a Pipeline
, for future predictions, but for now I'll leave it as is).
In [33]:
# split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 33, stratify = y)
In [39]:
# import modules
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
# prepare models
models = {'LOG': LogisticRegression(),
'LDA': LinearDiscriminantAnalysis(),
'KNN': KNeighborsClassifier(),
'CART': DecisionTreeClassifier(),
'NB': GaussianNB(),
'SVM': SVC(),
'LSVM1': LinearSVC(penalty='l1', dual = False),
# 'LSVM2': LinearSVC(),
'RF': RandomForestClassifier(),
'ADA': AdaBoostClassifier(),
'XGB': XGBClassifier()
}
Everything is set up. We will now evalute our models for accuracy and plot the results.
In [40]:
# evaluate each model in turn
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
names = []; results = []
print('MODEL: ACCURACY (STD)\n')
for name, model in models.items():
cv_results = cross_val_score(model, X_train, y_train, cv = KFold(10), scoring = 'accuracy')
names.append(name)
results.append(cv_results)
print('{}: {:.2f} ({:.2f})'.format(name, cv_results.mean(), cv_results.std()))
# print(classification_report(y_test, model.fit(X_train, y_train).predict(X_test), target_names = le.classes_))
In [43]:
# boxplot algorithm comparison
fig = plt.figure(figsize = (12,6))
plt.title('Accuracy comparison for different models')
ax = fig.add_subplot(111)
ax.set_ylim(0.3, 0.7)
plt.boxplot(results)
ax.set_xticklabels(names); # set xtick labels after plotting! (otherwise default values override custom labels)
if save_figures:
plt.savefig('fig_mlmanual_boxplot.png', dpi = 300)
In [84]:
# create confusion matrix to compare true labels and predicted labels
from sklearn.metrics import confusion_matrix
model = LogisticRegression().fit(X_train, y_train)
y_pred = model.predict(X_test)
conf_mat_labels = sorted(list(electrodes['TPLE category'].unique()))
conf_mat = confusion_matrix(y_test, y_pred).T
# conf_mat = confusion_matrix(y_test, y_pred, labels = conf_mat_labels)
plt.figure(figsize = (7, 7));
plt.title('Confusion matrix for model predictions')
sns.heatmap(conf_mat, annot = True, fmt = 'd', cbar = False, square = True, cmap = 'Purples', linewidth = .5,
xticklabels = conf_mat_labels, yticklabels = conf_mat_labels);
plt.xlabel('True TPLE category');
plt.ylabel('Predicted TPLE category');
if save_figures:
plt.savefig('fig_mlmanual_confusionmatrix.png', dpi = 300)
Hmm... TPLE category 2-5 is overrepresented... how many variables of our test set do we actually have in each category?
In [85]:
print('Nr of values in each category for y_test:\n')
y_test_unique = np.unique(le.inverse_transform(y_test), return_counts=True)
print('Unique values:', y_test_unique[0])
print('Unique values count:', y_test_unique[1])
I am starting to doubt whether accuracy is the best outcome metric. Let's get some other metrics too for the model used for the confusion matrix above.
In [96]:
from sklearn.metrics import accuracy_score, classification_report
print('Evaluation for:\n\n', model)
# accuracy score
print('\n\nAccuracy score:', accuracy_score(y_test, y_pred).round(2))
print('\n\nClassification report: \n\n', classification_report(y_test, y_pred, target_names = conf_mat_labels))
# cross val score
cvscore = cross_val_score(model, X_test, y_test, scoring = 'accuracy', cv = 10)
print('\n\nCross val score (std): {:.2f} ({:.2f})'.format(cvscore.mean(), cvscore.std())) # why different?
Let's say there is room for improvement.... hyperparameter tuning may help, we will do so using a cross-validated parameter grid search for SVM and XGB here.
In [28]:
# SVC tuning
from sklearn.model_selection import GridSearchCV
# SVC().get_params()
svm_param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
svm_param_grid = {'kernel': ['linear', 'rbf'], 'C': [1, 10]} # simple and less slow...
svm_grid = GridSearchCV(SVC(), svm_param_grid, cv = 3, scoring = 'accuracy', n_jobs = -1)
%time svm_grid.fit(X_train, y_train)
Out[28]:
In [29]:
print('Best SVM params:', svm_grid.best_params_)
print('Best SVM score:', svm_grid.best_score_.round(2))
print('Best SVM model:\n', svm_grid.best_estimator_)
In [30]:
# XGB tuning
# XGBClassifier().get_params()
xgb_param_grid = {'learning_rate': [0.001, 0.01, 0.1, 1.0],
'max_depth': [3, 5, 7],
'n_estimators': [100, 150, 200]
}
xgb_grid = GridSearchCV(XGBClassifier(), xgb_param_grid, cv = 3, scoring = 'accuracy', n_jobs = -1)
%time xgb_grid.fit(X_train, y_train)
Out[30]:
In [31]:
print('Best XGB params:', xgb_grid.best_params_)
print('Best XGB score:', xgb_grid.best_score_.round(2))
print('Best XGB model:\n', xgb_grid.best_estimator_)
In [32]:
# plot XGB tree
from xgboost import plot_tree
fig, axes = plt.subplots(figsize = (15,7));
plot_tree(xgb_grid.best_estimator_, ax = axes, rankdir='LR');
if save_figures:
plt.savefig('fig_mlmanual__xgbtree.png', dpi = 300); # will still look like shit
Let's try an automated ML approach using TPOT by Randal Olson for this purpose... this can take a while (12-24h).
A short test can be run using
TPOTClassifier(generations=5, population_size=50, verbosity=2)
.
In [33]:
# from tpot import TPOTClassifier
# tpot = TPOTClassifier(generations=50, population_size=50, verbosity=2, n_jobs = -1)
# tpot.fit(X_train, y_train)
# print('\nTPOT score: ', tpot.score(X_test, y_test))
# tpot.export('tpot_tple_classification.py')
The exported file tpot_tple_classification.py
contains the full pipeline from the fitted model. We'll import it below so we can manually use it without having to re-run the full TPOT process.
In [54]:
# construct TPOT fitted pipeline as model
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import FastICA
tpot_model = make_pipeline(
VarianceThreshold(threshold=0.4),
FastICA(tol=0.1),
RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.3, min_samples_leaf=1, min_samples_split=10, n_estimators=100)
).fit(X_train, y_train)
y_pred_tpot = tpot_model.predict(X_test)
In [61]:
plt.figure(figsize = (7, 7));
plt.title('Confusion matrix for TPOT predictions')
sns.heatmap(confusion_matrix(y_test, tpot_model.predict(X_test)).T, cbar = False, annot = True, fmt = 'd',
square = True, cmap = 'Oranges', linewidth = .5,
xticklabels = conf_mat_labels, yticklabels = conf_mat_labels);
plt.xlabel('True TPLE category');
plt.ylabel('Predicted TPLE category');
if save_figures:
plt.savefig('fig_tpot_confusionmatrix.png', dpi = 300)
In [95]:
print('Evaluation for fitted TPOT pipeline:\n\n', tpot_model)
# accuracy score
tpot_accscore = accuracy_score(y_test, y_pred_tpot)
print('\n\nAccuracy score: {:.2f}'.format(tpot_accscore))
print('\n\nClassification report: \n\n', classification_report(y_test, y_pred_tpot, target_names = conf_mat_labels))
# cross val score
tpot_cvscore = cross_val_score(tpot_model, X_test, y_test, scoring = 'accuracy', cv = 10)
print('\n\nCross val score (std): {:.2f} ({:.2f})'.format(tpot_cvscore.mean(), tpot_cvscore.std()))
As a quick glance to deep learning we will apply a Multilayer Perceptron (MLP) for multi-class softmax classification using stochastic gradient descent as an optimizer. The code is borrowed from the official keras Sequential model guide and adapted where needed.
In [59]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from keras.utils.np_utils import to_categorical
In [64]:
keras_model = Sequential()
keras_model.add(Dense(64, activation='relu', input_dim=len(elec_features.columns)))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(64, activation='relu'))
keras_model.add(Dropout(0.5))
keras_model.add(Dense(nr_of_y_categories, activation='softmax'))
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
keras_model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
keras_model.fit(X_train, to_categorical(y_train), epochs=100, batch_size=10)
y_pred_keras = keras_model.predict(X_test)
scores = keras_model.evaluate(X_test, to_categorical(y_test), batch_size=128)
print('\n\n{}: {:.2f}'.format(keras_model.metrics_names[1], scores[1]))
In [36]:
from keras.utils import plot_model
if save_figures:
plot_model(model, show_shapes = True, to_file = 'fig_deeplearning_keras_mlp.png')
We compared 3 different approaches in machine learning (manual, autoML and deep learning) to predict SEEG electrode implantation accuracy. Although I consider the results not bad for a start, there is definitely more work to do (on more data!) to reach a point in which a model like this could be implemented in e.g. a planning system to assist the surgeon in avoiding trajectories that are predicted to have a high deviation.